By Agboola Quam.
This document explores the titanic dataset containing various of the passengers that were on-board. This dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper- class.
What factors made people more likely to survive?
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os
import gc
import math
import sklearn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# load in the dataset into a pandas dataframe, print statistics
titanic = pd.read_csv('titanic.csv')
titanic.sample(10)
#1 Checking the dimension of the titanic dataframe
titanic.shape
We have 891 rows and 12 columns in our data frame
#2 Checking the data structure and also if there is missing value in any row in the data frame
titanic.info()
#missing data
total = titanic.isnull().sum().sort_values(ascending=False)
percent = (titanic.isnull().sum()/titanic.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
Let's analyse this to understand how to handle the missing data.
We'll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases. According to this, there is just a variable (Cabin) that we should delete. The point is: will we miss this data? I don't think so. None of these variables seem to be very important, since most of them are not aspects in which actually related to the survival of passengers (maybe that's the reason why data is missing?).
In what concerns the remaining cases, we can see that 'embarked' variable just have the two cases of missing data. Since it is just two observation, we'll delete these observations and keep the variable.
Regarding the remaining variables, we can see there were no missing cases.
(1) We can see we have missing values in Age column
(2)We can also see we have missing in cabin and embarked column, although we wont be analyzing cabin
The first thing to do before dropping values is to drop specific rows with no values or fill in the missing value with the mean
#droping the cabin, passengerid, name, ticket column from our dataset because it's not useful for analysis
titanic.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)
#Changing all the column names to lowercase and underscore for consistency and easy data cleaning.
titanic.rename(columns={'Survived':'survived','Pclass':'pclass','Sex':'sex','Age':'age','SibSp':'sibsp','Parch':'parch','Fare':'fare','Embarked':'embarked'},inplace=True)
#Checking if the changes has been applied
titanic.head()
(1) passengerid = the passengers id
(2) survived = whether the passengers survived or not (0-not survived, 1-survived)
(3) pclass = passenger ticket class (1-high class, 2-middle class, 3-low class)
(4) name,sex,age = the name, gender =, and age of passengers
(5) ticket, fare, cabin, embarked = the ticket, the amount paid, cabin, and in embarks
(6) sibsp = the number of siblings and spouse each passengers have on board
(7) parch = the number of parents and children each passengers have on board
#checking histogram of entite titanic dataset
titanic.hist(figsize=(10,8));
Age seems to be a lot that are missing, 714 instead of 891 entries.
We can just look at what the rows look like, because if they all have the same characteristics, it will be good to know The null values can be from the same group
#we look at dataframe where the age is null using the histogram plot
titanic[titanic.age.isnull()].hist();
We can see that age is off from the general plot above, so we will fill in the values with the mean
#Fillin in the missing value and rechecking the dataset info
titanic.fillna(titanic.mean(),inplace=True)
titanic.info()
The ages were all filled with the means
But the other variable (embarked) cannot be filled with the mean value because it is not a numeric data, it doesn't have a mean, just letters.
#checking the rows that are missing in embarked column
titanic[titanic.embarked.isnull()]
But it looks like the embarked column only have 2 missing values, 889 entries instead of 891 entries
Since it is just a small amount of missing, we can just drop them.
#dropping the rows in embarked that are missing and rechecking the data info
titanic.dropna(inplace=True);
titanic.info()
Now we have a complete dataset with no missing value
# Taking a look at some survival rates for babies
youngest_to_survive = titanic[titanic['survived'] == True]['age'].min()
youngest_to_die = titanic[titanic['survived'] == False]['age'].min()
oldest_to_survive = titanic[titanic['survived'] == True]['age'].max()
oldest_to_die = titanic[titanic['survived'] == False]['age'].max()
print('Youngest to survive :: ',youngest_to_survive)
print('Youngest to die :: ',youngest_to_die)
print('Oldest to survive :: ',oldest_to_survive)
print('Oldest to die :: ',oldest_to_die)
There are 889 passengers in the titanic dataset with 8 features (survived, pclass, sex, age, sibsp, parch, fare, and embarked). Most variables are numeric in nature, but the variables cut, color, and clarity are binary factor variables with the following levels.
(1) survived = whether the passengers survived or not (0-not survived, 1-survived)
(2) pclass = passenger ticket class (1-low class, 2-middle class, 3-high class)
(3) sex = female , male
I'm most interested in figuring out what features are best for predicting the survival of passengers in the titanic dataset.
By looking at one variable at a time, we can build an intuition for how each variable is distributed before moving on to more complicated interactions between variables.
Let's start our exploration by looking at the main variable of interest: price. Is the distribution skewed or symmetric? Is it unimodal or multimodal?
#descriptive statistics summary of passengers age
titanic['age'].describe()
Observations
# univariate plot of passengers age
sb.distplot(titanic['age']);
#descriptive statistics summary of passengers ticket fare
titanic['fare'].describe()
Observations
# univariate plot of passengers ticket fare
sb.distplot(titanic['fare']);
#Histogram plot for numeric variables in the titanic dataset
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
titanic.hist(ax=ax)
plt.show()
Fare - the majority of the passenbers didn't pay too much, it's skewed to the right
Pclass - most people are in third class
Survived - most people didn't survive than survived
sibsp - most people didn't come with their siblings or spouses
parch - most people didn't come with parents or children
age - age is also skewed to the right with majorith being around 20 and 40
#Checking number of passengers that survived and didn't survive in the titanic dataset
pd.DataFrame(titanic.survived.value_counts())
we see that 549 didn't survive, while 340 survived
#Checking counts of gender of the passengers
pd.DataFrame(titanic.sex.value_counts())
we see that most passengers are male with 577, while female with 312
#Checking counts of class of the passengers
pd.DataFrame(titanic.pclass.value_counts())
we see that most passengers are in low class with 491 passengers, 184 passengers are in middle slass, while 214 passengers are in first class
#Checking counts of age of the passengers
age=pd.DataFrame(titanic.age.value_counts())
#Checking counts in our dataset with bar chart
plt.figure(figsize=(10,4))
plt.subplot(1,2,1);
titanic.survived.value_counts().plot(kind='bar',title='Survival of Passengers',color=['C0','C1']);
plt.xlabel('Passengers')
plt.ylabel('Counts');
plt.subplot(1,2,2);
titanic.pclass.value_counts().plot(kind='bar',title='Class of the passengers',color=['C4','C5','C6']);
plt.xlabel('Class')
plt.ylabel('Counts');
titanic.sex.value_counts().plot(kind='bar',title='Gender of the passengers',color=['C2','C3']);
plt.xlabel('Gender')
plt.ylabel('Counts');
age.plot(kind = "bar", figsize = (20,8))
plt.ylabel("Number of Passengers")
plt.xlabel("Age")
plt.title("Total Number of Passengers")
plt.show()
age_of_passengers = titanic.groupby("age").size()
# Plot bar graph for Age of passengers on titanic like a normal distribution data
age_of_passengers.plot(kind = "bar", figsize = (20,8),color='red')
plt.ylabel("Number of Passengers")
plt.xlabel("Age")
plt.title("Total Number of Passengers")
plt.show()
plt.figure(figsize=(10,4))
plt.subplot(1,2,1);
survive_values=[549,340]
survive_labels=["No","Yes"]
plt.axis("equal")
plt.title("Pie chart of passengers who survived and didn't survived titanic")
plt.pie(survive_values,labels=survive_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0.1],wedgeprops={'edgecolor':'black'});
plt.subplot(1,2,2);
gender_values=[312,577]
gender_labels=["Female","Male"]
plt.axis("equal")
plt.title("Pie chart showing the passengers gender counts")
plt.pie(gender_values,labels=gender_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0.1],wedgeprops={'edgecolor':'black'});
class_values=[214,184,491]
class_labels=["high class","middle class","low class"]
plt.axis("equal")
plt.title("Pie chart showing the passengers class counts")
plt.pie(class_values,labels=class_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0,0.1],wedgeprops={'edgecolor':'black'});
When investigating the age, and embarked variables, we saw a problem (completeness issue) The age and embarked column contained missing values which i had to deal with
And also i noticed a consistenscy issue while assessing visually with the variable names, so i had to change varible names to lower case for consistency.
To start off with, I want to look at the pairwise correlations present between features in the data.
Through these bivariates plots, we can learn how changes in one variable might affect the variable in the second, and identify clusters and patterns in the dataset.
numeric_vars = ['age', 'fare']
categoric_vars = ['survived', 'pclass', 'sex','sibsp','parch','embarked']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(titanic[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
#survival correlation matrix
#correlation matrix
corrmat = titanic.corr()
f, ax = plt.subplots(figsize=(12, 9))
k = 8 #number of variables for heatmap
cols = corrmat.nlargest(k, 'survived')['survived'].index
cm = np.corrcoef(titanic[cols].values.T)
sb.set(font_scale=1.25)
hm = sb.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
# plot matrix: checking relationship between the numerical variables in the dataset
g = sb.PairGrid(data = titanic, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);
As expected, the numeric variables (age and fare) dimensions are all highly correlated with one another
#using heat map
#Is there a relationships between the price of tickets in dollars and the age of the passengers
plt.hist2d(data=titanic,x='age',y='fare',cmin=0.5,cmap='viridis_r');
plt.colorbar()
plt.xlabel('age of passengers')
plt.ylabel('amount paid($)');
Box plots are used to show overall patterns of response for a group. They provide a useful way to visualise the range and other characteristics of responses for a large group. They allow comparing groups of different sizes.
The unquestionable advantage of the violin plot over the box plot is that aside from showing the abovementioned statistics it also shows the entire distribution of the data. This is of interest, especially when dealing with multimodal data, i.e., a distribution with more than one peak.
I used box plot to check the relationship between qualitative and quantitative variable in the titanic dataset, boxplot do a fine job of summarizing the titanic dataset, but there are some distributional details that can het lost which can be seen with the violin plot
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='survived',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs success');
We can see from the box plot that the passengers that didn't survive are more than those who did
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='survived',y='fare',color=base_color);
plt.xticks(rotation=15);
The violin plot gave us a detiled exlanation thet those who didn't survive paid lesser fare for the ticket while those that have higher chance of survival paid more
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='pclass',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs pclass');
We can see from the box plot that most passengers paid for low class while least passengers paid for high class
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(fare)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='pclass',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of fare vs pclass');
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(sex)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='sex',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs sex');
We can see from the box plot that most passengers in the titanic that paid more fare are females
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(sex)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='sex',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of fare vs sex');
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='survived',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of age vs survived');
#Plotting a violin plots for relationship between quantitative(age) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='survived',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs survived');
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='pclass',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of age vs pclass');
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='pclass',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs pclass');
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(age)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='sex',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of sex vs age');
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(age)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='sex',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs sex');
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [90, 90])
g = sb.PairGrid(data = titanic, y_vars = ['fare', 'age'], x_vars = categoric_vars,
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
survive=titanic.survived==True
died=titanic.survived==False
Comparing the distribution of Age for the passengers who survived and didn't survive
titanic.age[survive].hist(alpha=0.5,label='survived')
titanic.age[died].hist(alpha=0.5,label='died')
plt.legend()
plt.xlabel('age')
plt.ylabel('survival counts')
plt.title('Histogram distribution of age vs survival');
It does look like the really young children have a higher chance of surviving than other ages
Showing relationships between two categorical varibales in the titanic dataset
Comparing the distribution of passengers gender who survived and didn't survive
#grouping two categorical variables together(sex and survived)
bygender=titanic.groupby("sex").survived.value_counts()
bygender
This result below, we have 340 out of 889 passengers that survived the titanic; which has a percentage of 38.25%,where females (26%) where more than male (12.261%) that survived
While 549 out of 889 patients did not survive the titanic; which has a percentage of 62%,where females (14.75%) where more than male (85,25%) that did not survived
That is female are more likelt to survive than male .i.e. males passengers are likely to not survive the titanic than females passengers
#plotting stacked bar chart of gender vs survived
bygender.unstack().plot(kind='bar',stacked=True);
plt.title('Gender vs Survived',fontsize=18)
plt.xlabel('gender',fontsize=18)
plt.ylabel('passengers',fontsize=18);
This plot shows that females passengers are likely to survive the titanic than males passengers (.i.e. males are less likely to not survive). It does look like there are definitely more females surviving than male
#plotting clustered bar chart of gender vs survived
bygender=bygender.reset_index(name='count')
bygender.pivot(index='sex',columns='survived',values='count')
sb.countplot(data=titanic,x='sex',hue='survived')
plt.xticks(rotation=15)
plt.title('Gender vs Survived');
This plot shows that females passengers are likely to survive the titanic than males passengers (.i.e. males are less likely to not survive). It does look like there are definitely more females surviving than male
Comparing the distribution of passengers gender and the class of the tickets they paid for
#grouping two categorical variables together(sex and pclass)
byc=titanic.groupby("sex").pclass.value_counts()
byc
Note, male passengers (64.9%) are more than female passengers(35.1%) in the titanic dataset
This result below, we have 214 out of 889 passengers that are in first class; which has a percentage of 24.1%,where females (10.35%) where more than male (13.723%) that were in first class.
We have 184 out of 889 passengers that are in second class; which has a percentage of 20.7%,where females (8.55%) where more than male (12.15%) that were in second class class.
We have 491 out of 889 passengers that are in third class; which has a percentage of 55.23%,where females (16.2%) where more than male (39.03%) that were in third class.
In general, we have more passengers in low class (55.2%) than the middle class(20.7%) and high class(24.1%).
#plotting stacked bar chart of gender vs pclass
byc.unstack().plot(kind='bar',stacked=True);
plt.title('Gender vs pclass',fontsize=18)
plt.xlabel('gender',fontsize=18)
plt.ylabel('passengers',fontsize=18);
From the plot, it does look like female are in more expensive, or are spending more, probably more in first class
#plotting clustered bar chart of gender vs pclass
byc=byc.reset_index(name='count')
byc.pivot(index='sex',columns='pclass',values='count')
sb.countplot(data=titanic,x='sex',hue='pclass')
plt.xticks(rotation=15)
plt.title('Gender vs Pclass');
Comparing the distribution of passengers who survived and didn't survive and the class of the tickets they paid for
#grouping two categorical variables together(survived and pclass)
bys=titanic.groupby("survived").pclass.value_counts()
bys
in the result above, we have 134 out of 889 passengers that are in first class that survived the titanic; which has a percentage of 15.1% while 80 out of 889 passengers that are in first class didn't survive the titanic; with the percentage of 8.999%.
We have 87 out of 889 passengers that are in second class that survived the titanic; which has a percentage of 9.79% while 97 out of 889 passengers that are in second class didn't survive the titanic; with the percentage of 10.911%.
We have 119 out of 889 passengers that are in third class that survived the titanic; which has a percentage of 13.386% while 372 out of 889 passengers that are in third class didn't survive the titanic; with the percentage of 41.85%.
In general, passengers in high class are more likely to survive than passengers in middle class and low class, because the probability of surviving is more than not surviving.
#plotting stacked bar chart of survived vs pclass
bys.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs pclass',fontsize=18)
plt.xlabel('survived',fontsize=18)
plt.ylabel('passengers',fontsize=18);
#plotting clustered bar chart of survived vs pclass
bys=bys.reset_index(name='count')
bys.pivot(index='survived',columns='pclass',values='count')
sb.countplot(data=titanic,x='survived',hue='pclass')
plt.xticks(rotation=15)
plt.title('Survived vs Pclass');
Comparing the distribution of having family (siblings and spouse) on board is associated with survival
#grouping two variables together(survived and sibsp)
bysib=titanic.groupby("sibsp").survived.value_counts()
#plotting stacked bar chart of survived vs siblings/spouses
bysib.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs siblings/spouses',fontsize=18)
plt.xlabel('siblings/spouse',fontsize=18)
plt.ylabel('passengers',fontsize=18);
From our result;
Comparing the distribution of having family (parents and children) on board is associated with survival
#grouping two variables together(survived and parch)
bypar=titanic.groupby("parch").survived.value_counts()
#plotting stacked bar chart of survived vs parent/children
bypar.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs parents/children',fontsize=18)
plt.xlabel('parents/children',fontsize=18)
plt.ylabel('passengers',fontsize=18);
From our result;
In general, passengers with big families didn't appear to be survive as well as those who are alone
Where the passengers were currently when they add impact with the iceberg and their chances of survival
#grouping two variables together(survived and embarked)
byemb=titanic.groupby("embarked").survived.value_counts()
byemb
#plotting stacked bar chart of survived vs embarked
byemb.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs embarked',fontsize=18)
plt.xlabel('embarked',fontsize=18)
plt.ylabel('passengers',fontsize=18);
In general, passengers that embarked from Cherbourg are more likely to survive than other point of embark. Therefore, embarked seem to have some association with the chance of survival
#scatterplot
sb.set()
cols = ['fare', 'age', 'survived', 'pclass', 'parch', 'sibsp']
sb.pairplot(titanic[cols], size = 2.5)
plt.show();
sb.lmplot('age','survived',data=titanic);
sb.lmplot('age','survived',data=titanic,hue='pclass');
This plot shows that older passengers are less likely to survive.
sb.lmplot('age','survived',data=titanic,hue='sex');
The number of passengers boarded at Southhampton are more compared to Cherbourg and Queenstown, but Cherbour passengers are more likely to survive than Southhampton passengers. So there is a chance that Embarked helps in prediction.
Visualizing three or more vriables
g = sb.FacetGrid(titanic, col="pclass", hue="survived")
g.map(plt.scatter, "fare", "age", alpha=.7)
g.add_legend();
g = sb.FacetGrid(titanic, row="survived", col="pclass", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);
g = sb.FacetGrid(titanic, row="survived", col="parch", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);
g = sb.FacetGrid(titanic, row="survived", col="sibsp", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);
ttype_markers=[['male','o'],['female','^']]
for ttype, marker in ttype_markers:
plot_data=titanic.loc[titanic['sex']==ttype]
sb.regplot(data=plot_data,x='fare',y='age',x_jitter=0.04,marker=marker,fit_reg=False);
plt.xlabel('Passengers Fare')
plt.ylabel('Passengers Age')
plt.title('AGE vs FARE vs SEX')
plt.legend(['male','female']);
#Class and gender wise segregation of passengers
sb.factorplot('survived', col='pclass', hue='sex', data=titanic, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
sb.pointplot(data = titanic, x = 'survived', y = 'fare', hue = 'pclass',
palette = 'Greens', linestyles = '', dodge = 0.4)
plt.title('passengers fare across survived and pclass')
plt.ylabebl('Mean Fare ($)')
plt.yscale('log')
plt.yticks([20, 40, 60, 80, 100, 120, 140],[20, '40', '60', '80', '100', '120', '140'])
plt.show();
facet_grid = sb.FacetGrid(titanic, col='survived', row='pclass', size=2.2, aspect=1.6)
facet_grid.map(plt.hist, 'age', alpha=.5, bins=20)
facet_grid.add_legend();
This plot shows that a passenger in higher class are more likely to survive than passengers in lower class
#Log transforming the x-variable(price) because the price variable was skewed
#so we log transform price variable to reduce skewness
fare_log = np.log10(titanic['fare'])
age_log = np.log10(titanic['age'])
# multivariate plot of price by carat weight, and clarity
#Using faceted scatterplot
g=sb.FacetGrid(data=titanic,col='survived',margin_titles=True,col_wrap=3)
g.map(plt.scatter,'age','fare');
# multivariate plot of fare by pclass and survived
#Fare by survived and pclass using box plot
sb.pointplot(data=titanic,x='pclass',y='fare',hue='survived',ci='sd',linestyles=" ",dodge=True);
plt.xticks(rotation=15)
plt.ylabel('Average fare($)');
# multivariate plot of fare by pclass and survived
#Fare by survived and pclass using box plot
sb.boxplot(x='pclass',y='fare',hue='survived',data=titanic);
sb.set(style="ticks")
plt.xticks(rotation=15)
plt.ylabel('Average fare($)');
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.title("Seaborn Plot with Legend Outside")
plt.tight_layout()
plt.savefig("place_legend_outside_plot_Seaborn_boxplot.png",
format='png',dpi=150)
It can be seen that there is a relationship between fare, pclass and survival
It can be seen through the box plot that people in high class are more likely to survive than middle and low class
#Making 3 copies of titanic data set for the three models
titanic_rf=titanic.copy()
titanic_lda=titanic.copy()
titanic_log=titanic.copy()
Random Forest
# Convert string values to int values for the ease of prediction
titanic_rf = pd.get_dummies(titanic_rf)
titanic_rf.head()
# Split data into training and testing set
X = titanic_rf.iloc[:,1:]
Y = titanic_rf["survived"]
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)
# Random Forest Classification Algorithm applied to train data
RF = RandomForestClassifier(n_jobs = 2)
RF.fit(X_train, Y_train)
RF.score(X_train, Y_train)
# Predict Random Forest Algorithm on Test Data
predictions_RF = RF.predict(X_test)
# Print Accuracy Score for Random Forest Algorithm
acc=accuracy_score(Y_test, predictions_RF)
print('Accuracy :: ',acc)
# Classification Report of Prediction
print(classification_report(Y_test, predictions_RF))
Here, it can be observed that 79% (78% not survived and 79% survived) of the test data is predicted precisely. Columns 0 and 1 represent "not survived" and "survived" respectively.
# Confusion Matrix for predictions made
confusion_matrix1 = confusion_matrix(Y_test, predictions_RF)
print(confusion_matrix1)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix1), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Random Forest Classification Algorithm', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');
rand_predict = RF.predict(X_test)
rand_predict
Final_predictions = pd.DataFrame({'survived':rand_predict})
Final_predictions.to_csv('sample_submission.csv',index=False)
Final_predictions.head()
Feature Importance
Another great quality of random forest is that they make it very easy to measure the relative importance of each feature. Sklearn measure a features importance by looking at how much the treee nodes, that use that feature, reduce impurity on average (across all trees in the forest). It computes this score automaticall for each feature after training and scales the results so that the sum of all importances is equal to 1. We will acces this below:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(RF.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)
importances.plot.bar();
Conclusion:
Embarked category, sibsp and parch doesn’t play a significant role in our random forest classifiers prediction process. Because of that I will drop them from the dataset and train the classifier again. We could also remove more or less features, but this would need a more detailed investigation of the features effect on our model. But I think it’s just fine to remove Embarked category, sibsp and parch.
Training Random Forest Again
train_df = titanic_rf.drop("embarked_S", axis=1)
test_df = titanic_rf.drop("embarked_S", axis=1)
train_df = titanic_rf.drop("embarked_Q", axis=1)
test_df = titanic_rf.drop("embarked_Q", axis=1)
train_df = titanic_rf.drop("embarked_C", axis=1)
test_df = titanic_rf.drop("embarked_C", axis=1)
train_df = titanic_rf.drop("parch", axis=1)
test_df = titanic_rf.drop("parch", axis=1)
train_df = titanic_rf.drop("sibsp", axis=1)
test_df = titanic_rf.drop("sibsp", axis=1)
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")
Our random forest model predicts as good as it did before. A general rule is that, the more features you have, the more likely your model will suffer from overfitting and vice versa. But I think our data looks fine for now and hasn't too much features.
# Predict Random Forest Algorithm on Test Data
predict_RF = random_forest.predict(X_test)
# Print Accuracy Score for Random Forest Algorithm
acc2=accuracy_score(Y_test, predict_RF)
print('Accuracy :: ',acc2)
# Classification Report of Prediction
print(classification_report(Y_test, predict_RF))
# Confusion Matrix for predictions made
confusion_matrix11 = confusion_matrix(Y_test, predict_RF)
print(confusion_matrix11)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix11), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Random Forest Classification Algorithm', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');
rand_predict2 = random_forest.predict(X_test)
rand_predict2
Final_predictions2 = pd.DataFrame({'survived':rand_predict2})
Final_predictions2.to_csv('sample_submission.csv',index=False)
Final_predictions2.head()
# Convert string values to int values for the ease of prediction
titanic_lda = pd.get_dummies(titanic_lda)
titanic_lda.head()
##Drop one of the dummy variable for each column
titanic_lda.drop(['sex_female','embarked_C'],axis=1,inplace=True)
# Split data into training and testing set
X1 = titanic_lda.iloc[:,1:]
Y1 = titanic_lda["survived"]
X1_train, X1_test, Y1_train, Y1_test = model_selection.train_test_split(X1, Y1, test_size = 0.2)
# LDA applied to train data
lda = LinearDiscriminantAnalysis()
lda.fit(X1_train,Y1_train)
pred_lda = lda.predict(X1_test)
# Print Accuracy Score for Linear Discrimination Analysis
acco=accuracy_score(Y1_test, pred_lda)
print('Accuracy :: ',acco)
# Classification Report of Prediction
print(classification_report(Y1_test, pred_lda))
Here, it can be observed that 81% (82% not survived and 80% survived) of the test data is predicted precisely. Columns 0 and 1 represent "not survived" and "survived" respectively.
# Confusion Matrix for predictions made
confusion_matrix2 = confusion_matrix(Y1_test,pred_lda)
print(confusion_matrix2)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix2), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Linear Discrimination Analysis', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');
titanic_log.head()
# Convert string values to int values for the ease of prediction
titanic_log = pd.get_dummies(titanic_log)
titanic_log.head()
##Drop one of the dummy variable for each column
titanic_log.drop(['sex_female','embarked_C'],axis=1,inplace=True)
titanic_log.head()
X=titanic_log.drop("survived",axis=1)
y=titanic_log['survived']
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
#checking the logistic model coefficient
result.params
#Exponentiate the coefficients to get the log odds
np.exp(result.params)
#Checking if variables are significant with their pvalues
result.pvalues
# odds ratios and 95% CI
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf)
from sklearn.linear_model import LogisticRegression
logmodel=LogisticRegression()
logmodel.fit(X_train,Y_train)
y_pred = logmodel.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logmodel.score(X_test, Y_test)))
from sklearn.metrics import confusion_matrix
confusion_matrix3 = confusion_matrix(Y_test, y_pred)
print(confusion_matrix3)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix3), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Logistic Regression Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
The result is telling us that we have 83+51 correct predictions and 16+28 incorrect predictions.
# import the metrics class
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(Y_test, y_pred))
print("Precision:",metrics.precision_score(Y_test, y_pred))
print("Recall:",metrics.recall_score(Y_test, y_pred))
from sklearn.metrics import classification_report
print(classification_report(Y_test, y_pred))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, logmodel.predict(X_test))
fpr, tpr, thresholds = roc_curve(Y_test, logmodel.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).
results = pd.DataFrame({
'Model': ['Random Forest', 'LDA', 'Logistic Regression'],'Score': [acc2, acco, metrics.accuracy_score(Y_test, y_pred)]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)
As it can be seen, the Linear Discrimination Analysis goes on the first place.we use cross validation.
Random Forest is useful for both Classification and Regression!
Also, it will create a multitude of (generally very poor) trees for the data set using different random subsets of the input variables, and will return whichever prediction was returned by the most trees.
This helps to avoid “overfitting”, a problem that occurs when a model is so tightly fitted to arbitrary correlations in the training data that it performs poorly on test data.
This model is created, where, if given some new data related to titanic, can predict if the person will/will not survive, with an accuracy of 78.65%.
In the dataset we find that the independent variables are not normally distributed, which is the fundamental assumption while using LDA.
Moreover, by using LDA we find that it offers a better accuracy and recall, when compared to Random Forest.
This model is created, where, if given some new data related to titanic, can predict if the person will/will not survive, with an accuracy of 82.02%.
I extended my investigation of fare against pclass and survival in this section by looking at the impact of fare and pclass against the chance of survival. The multivariate exploration here showed that there indeed is a positive effect of increased survival rate on the fare passengers paid for ticket.
I also look at fare against age and survival in this section by looking at the impact of fare and age against the chance of survival. The multivariate exploration here showed that the young passengers have higher chance of surviving than the old people.
During the data preprocessing part, we computed missing values, converted features into numeric ones, grouped values into categories and created a few new features. Afterwards we started training 3 different machine learning models, picked one of them (LDA) and applied cross validation on it. Then we discussed how LDA works, took a look at the importance it assigns to the different features and tuned it’s performace through optimizing it’s hyperparameter values. Lastly, we looked at it’s confusion matrix and computed the models precision, recall and f-score.